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Abstract. We give several modifications of the Goulden-Jackson Cluster 
method for finding generating functions for words avoiding a given set of forbid- 
den words. Our modifications include functions which can take into account 
various 'weights' on words, including single letter probability distributions, 
double letter (i.e. pairwise) probability distributions, and triple letter prob- 
ability distributions. We also describe an alternative, recursive approach to 
finding such generating functions. We describe Maple implementations of the 
various modifications. The accompanying Maple package is available at the 
website for this paper. 



1. Introduction 

Suppose we are given a finite alphabet, and a finite set of forbidden words in 
this alphabet. We would like to know how many n-letter words in our alphabet 
avoid the forbidden words as subwords or factors, i.e. strings of consecutive letters. 
In order to do this we will find the generating function for the number of such 
words. In Section [2] we describe a straightforward, recursive approach to solving 
this using ordinary generating functions. The remainder of this article deals with 
the Goulden-Jackson cluster method, a powerful method that considers overlap be- 
tween forbidden factors in computing generating functions. The Goulden-Jackson 
cluster method was introduced in |GJ79| and |GJ83| and described very clearly 
and concisely in |NZ99j . For earlier work see |GQ81| , and further extensions can be 
found in [Kon05 . Applications of the Goulden-Jackson cluster method to genomics 
can be found in [HXYCOObj . |XH02j . and jHXYCOOaj . In Section [3] we review the 
classical Goulden-Jackson cluster method. We then describe some modifications 
to the original Goulden-Jackson problem as follows: In Section [4] we modify the 
Goulden-Jackson cluster method to take single letter weights into account. In Sec- 
tion [5] we include double letter (i.e. pairwise) weights, and in Section [6] we consider 
triple letter weights. Further generalizations are described in Section [7] 

All the methods discussed have been implemented in a Maple package which 
accompanies this paper. The Maple package, which includes documentation, can 
be found at the website for this paper Q. 
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2. The Straightforward Recursive Approach 

Given a finite alphabet A and a set of forbidden or 'bad' words B, we would like 
to find a(n), the number of 7i-letter words in the alphabet A that do not contain any 
members of B as factors. Rather than find a(n) directly, we will find the generating 
function 

oo 

(1) /(t) = $>(n)i»\ 

n=0 

In order to find this generating function, we need not use the Goulden-Jackson 
cluster method, which will be described in Section [3] We can use a straightforward 
recursive approach, though as we will demonstrate, this method will not be as 
efficient as Goulden-Jackson. The approach contained in this section was described 
by Dr. Doron Zeilberger in his Spring 2008 Experimental Math class at Rutgers 
University. 

We first illustrate the approach with an example. Suppose A = {a, b} and 
B = {abb,ba}. We will start by decomposing the set of allowed words according 
to their first letter. Consider the set of allowed words beginning with a. Any such 
word is either a itself, or consists of a followed by a smaller word starting with either 
a or b. What are the restrictions on the smaller word, following the initial letter 
a? If it starts with a, it must still avoid abb and ba, but there are no additional 
restrictions. However, if the word following the initial letter a begins with b, it 
must avoid abb and ba, but in addition it must not begin with bb, so as to avoid the 
forbidden word abb. This gives a correspondence among different sets of words. 

Denote by [B, a i: \w\, ...»„}] the set of words avoiding members of B, starting 
with a,, and avoiding any word in {w\, ...w n } as an initial subword. Translating 
the above example into this notation, we have: 

(2) [B, a, {}] <-> {a} U [B, a, {}] U [B, b, {bb}}, 

where the latter two terms on the right hand side describe the allowable subwords 
following the initial letter a. 

Now let us consider the set of allowable words beginning with b. Either the word 
is b itself, or consists of b followed by a smaller word starting with either a or 6. 
Of course the letter following the initial b cannot be a, since this would form the 
forbidden word ba, but rather than exclude this a priori, we will instead say that 
any word following the initial b and starting with a must not begin with the word 
a. Of course no words satisfy this condition so the corresponding set will be empty. 
Any word following the initial b that starts with b must avoid the forbidden words 
in B but has no additional restrictions. To summarize, we have 

(3) [B, b, {}} <-> {b} U [B, a, {a}} U [B, b, {}]. 
We can express the allowable words with the decomposition 

(4) [B, *, {}] «-> {cmpty.word} U [B, a, {}] U [B, b, {}], 

where [.£?,*,{}] denotes the set of all words in alphabet A avoiding the words in B 
(with no additional restrictions). 

Eventually we will turn these correspondence relations into equations, by taking 
a weighted count of the set elements. First, in order to solve explicitly for the latter 
two sets in the above relation, we must decompose the remaining sets on the right 
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hand sides of |(5J) and |3]), as well as any new sets arising from those relations. This 
leads to the following: 

[B,b,{bb}} <- {b}U[B,a,{a}]U[B,b,{b}] 
[B,a,{a}] <-► 
[B,b, {b}} <-> 0. 

If one simply wants to know the number of allowable n-letter words, a generating 
function such as the one given in equation |T]) can be found. However, it is pos- 
sible to find variant generating functions which give more information about the 
allowable words. These variants will be described in later sections. In order to find 
the various generating functions, we will make use of a weighted counting system, 
performing a weighted count of the words in the sets above. Taking weights in 
gives the desired generating function. The particular 'weight' used varies based 
on the method (to be described in the following sections) so we postpone further 
calculations. It is worth pointing out, however, that one must be careful not to 
merely sum the weights of the sets on the right hand sides of the correspondence 
relations. Rather, it is necessary to account for the weights of the truncated initial 
letters, as well as any transition weights that may arise. See Example [1] for further 
details. 

The Maple code for this recursive method can be found in our accompanying 
Maple package under the function names RecursiveSingle, RecursiveDouble, 
and RecursiveProbDouble. These functions implement the straightforward recur- 
sive analogues of the cluster method generalizations to be described in Sections 21 
[5l and 17. 11 respectively. 

3. Basic Goulden- Jackson Cluster Method 

We borrow from [NZ99] in briefly reviewing the basic Goulden-Jackson clus- 
ter method, and encourage the reader to consult this source for a more detailed 
exposition. 

In order to find the generating function given in equation ([I]), we will do a 
weighted count of marked words. A marked word is a pair (w;S), where w is a 
word in the alphabet A, and S is an arbitrary multiset whose entries are members 
of Bad(w), the forbidden words of B contained as factors in w. We allow repetition 
in S since a word may contain several copies of a given forbidden word. If no subset 
S is specified, we assume it is the empty set. We define the weight of a marked 
word as weig ht(>; S) = (-l^ltH, where \S\ is the cardinality of S and w\ is the 
length of w. The weight of a set of marked words is obtained by summing the 
weights of the marked words in the set. The generating function from equation ([T]) 
now becomes 

(5) /(*)= £ £ (-l)lsitM, 

w£A* SCBad(w) 

where A* is the set of all words in the alphabet A. In order to see why this is 
valid, consider an arbitrary word w in our alphabet. This word will appear as the 
first argument in 2 k marked words, where k is the number of forbidden subwords 
contained in w. In equation (|5|), we sum over all possible subsets of Bad(w). Thus 
if w contains no forbidden subwords, it will be counted exactly once in the above 
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sum. If w contains k forbidden subwords, k > 0, the number of times it will be 
counted in the above sum is: 

£(-1)* (T) =(1 + (-1))* = 0- 
i=o w 

Thus every allowable word is counted once, while words containing forbidden words 
are not counted. We have verified the equivalence of equation ([5]) and our original 
generating function, given in equation (JTJ) . 

We call a marked word (w; S) a cluster if neighboring factors in S overlap (i.e. are 
not disjoint) in w, and the forbidden words of 5* span all of w. For example, if our 
alphabet A is {a, b, c} and the set of forbidden words B is {ba, aca}, then the marked 
word (bacaca; {ba, aca, aca}) is a cluster. The marked word (bacaca; {aca, aca}) is 
not a cluster because the factors in S do not span all of w (the first b is not part of 
a factor in S), and the marked word (acaba; {aca, ba}) is not a cluster because the 
factors in S do not overlap. We will denote the set of all (nonempty) clusters by C. 

We now decompose M, the set of marked words, into three groups: the empty 
word, marked words beginning with a letter that is not part of any cluster, and 
marked words beginning with a cluster. We thus obtain the decomposition 

M = {empty .word} U AM U CM, 

where an element of AM consists of a single letter of the alphabet A prepended to 
a marked word and an element of CM consists of a cluster prepended to a marked 
word. Let m be the number of letters in A. By taking weights on both sides of the 
preceding equation, we obtain 

weight(.M) = 1 + mt ■ weight(A / f) + weight (C) weight (M). 

But weight (M) equals f(t), as shown in equation (O, so by substituting and solving 
for f(t) we get 

(6) /(*) ' 



1 — mt — weight(C) 

and it remains to solve for weight(C), which we will call the cluster generating 
function. 

In order to find weight (C), we partition the set of clusters C according to the 
first forbidden word of the cluster. Let C[v] denote the set of clusters starting with 
forbidden word v. Then C = C[v], and weig ht(C) = ^2 weight (C[u]). 

veB veB 

In order to find weight (C [v] ) , we will further decompose C[v] as follows: con- 
sider a cluster in C[v\. Either it consists of v alone, or we can remove v from the 
list of forbidden words in our marked cluster, and what remains will contain a 
smaller cluster, beginning with some bad word u such that some initial subword 
of u coincides with some final subword of v. For example, consider the cluster 
(bacaca; {ba, aca, aca}), where the alphabet and forbidden word set are as above. 
This cluster is in C[6a]. Removing the initial forbidden word ba leaves a new cluster 
(acaca; {aca, aca}) in C[aca\. Let 0(v, u) be the set of possible 'overlaps' of v and 
u, that is, all possible (nonempty) intersections of final subwords of v with initial 
subwords of u. Each of these overlaps corresponds to a way that our cluster can 
have its first two forbidden words be v and u, respectively. To create the smaller 
cluster we will peel off exactly the part of v that does not overlap with u. For any 
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word v and a final subword r of v, v\r will denote the word obtained by chopping 
r from the end of v. For example, abcb\cb — ab. This leads to the decomposition 

CM <-{(«; {«})} U (J |J (C[u] • (u\r)), 

where Wi ■ W2 is the concatenation of W\ with W%. Taking weights, we obtain the 
following linear equations. Note that the sum over u 6 B is negative (i.e. multiplied 
by —1) in order to compensate for having reduced the number of bad words in our 
cluster by one (because we are calculating weights of clusters containing one fewer 
bad word than the clusters in C[v]). 

weight (C[v]) = weight ((v; {v})) — ^ (weight (C[u\) ■ weight(v\r)^ 

u£B r£0(v,u) 

We can explicitly calculate 0(v,u), and so by writing this equation for C[v] for 
all v € B, we obtain a sparse system of \B\ linear equations in \B\ unknowns. 
Solving for the weight(C[w]) and summing them gives us weight(C), which can then 
be substituted into equation (|6|), giving the desired generating function. 

Variations of the Basic Cluster Method. By changing how the weight of a 
word is defined, we can alter the interpretation of the resulting generating function. 
In the following sections we present several such variations. All the variations keep 
track of how many words of each length avoid the set of forbidden words. The first 
variation, described in SectionHl also takes into account how many times each letter 
appears in any given 'good' word, by adding extra variables into the weight func- 
tion. One possible use of this is to substitute probabilities for these variables, thus 
giving a generating function which takes into account a probability distribution on 
the alphabet. Several other variations mentioned later also take into account double 
letter, or pairwise weights. These are variables corresponding to each ordered pair 
of letters in the alphabet. This allows tracking of which consecutive letter com- 
binations occur, and can also allow for double letter probabilities to be filled in. 
Similarly, Section [5] has variables in the weight function corresponding to ordered 
triples of letters. In the sections that follow, we describe these modifications to the 
basic Goulden- Jackson cluster method in detail. 

4. Single Letter Weights 

In order to keep track not only of how many words of a certain length avoid 
certain subwords, but also which letters these words contain, we will redo the basic 
cluster method, using a different weight enumerator. The variation discussed in 
this section was initially described in |NZ99| . The weight of a marked word (w; S) 
(where w = W1W2 ■ • ■ Wk) will be 

(—l)\S\f\w\ r r r 

For example, weight(a6ca6; {}) = t 5 (x a ) 2 (xb) 2 x c . 

As in the original application of the Goulden-Jackson method, we use the de- 
composition M. — {empty _word} U AM UCA4. This leads to the following recursive 
formula for weight ( M. ) : 

weight(X) = l + fY] x a weight(A^) + weight(C)weight(7W). 
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Note that, since we are keeping track of letter weights, the second term records not 
just how many letters are in A, but exactly which ones appear. Simplifying, we get 

(7) weight(X) = — — — . 

1 — t x a — weight(C) 

All that remains is to solve for the cluster generating functions, weight (C). We 
do this exactly as in the original Goulden-Jackson cluster method, with the same 
decomposition: weight(C) = ^2 veB weight (C [?;]). We will write the same system of 
linear equations as before, except that the weight function is different. In particular, 
we still have 



weight (C[v]) = weight ((v; {v})) — ^ f weight (C[u\) ■ ^ weight(u\r)^ 
for all v E B, which becomes 

weight(CH) = -t^x Vl ...x Vjvl - Yl (weight(CM) • ^ t^x Vl . . . . 

uG-B rGO{v,u) 

Solving for weight (C[i>]) for each forbidden word v and substituting back into 
equation (JT)) yields the desired generating function. 

Example 1. Find the generating function of all words in the alphabet {a, b} avoid- 
ing the forbidden words abb and ba. 

We will find the generating function in two ways: (1) using the cluster method de- 
scribed in this section, and (2) using the straightforward recursive approach from 
Section H 

(1) We have: 

weight (C[abb]) = -t 3 x a xl - weight(C[ba])t 2 x a x b 
weight(C[6a]) = — t 2 x b x a — weight (C[a66])tefc 
from which 



weight (C[abb\) 



4-Ory, ™ 2 I J.A~,2 /y ,2 

1 t^X a Xfa 



weight(C[H) = -t 2 x b x a + X f* J X f ■ 

1 t Xa%fo 

and therefore weight(C) = Z ggg6 + + feggf _ Substi _ 

1 t X a Xfo 

tuting this into equation J7J) yields the desired generating function: 

1 t XnXl. 



weight (M) 



L a-^ b 



1 - tx a - tx b + t 2 X a X b ' 
Taking the first few terms of the Taylor expansion of this generating func- 
tion yields: 

l+{x a +x b )t+(x a x b +x a 2 : +x b 2 )t 2 + {x a 2 x b +x a 3 l +x b 3 )t 3 : +(x a 3 x b +x a 4 +x b i )t 4 +0(t 5 ) 

The constant term 1 corresponds to the empty word. The coefficients of 
powers of t correspond to the allowable words. For example, the coefficient 
of t 3 corresponds to the permissible 3-letter words aab, aaa, and bbb. 
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(2) Returning to the notation of Section O we need to take the weight of 
[B, *,{}} . Recall the following set decompositions: 

i B '>*){}] <^> {empty _word} U [B,a, {}] U [B,b, {}} 

[B,a,{}] <-> {a}U[B,a,{}]U[B,b,{bb}] 

[B,b,{}} - {b}U[B,a,{a}]U[B,b,{}} 
[B,b,{bb}\ «-> {&}U[B,a,{a}]U[B,6,{&}] 
[B,a,{a}\ «-» 
[B,6, {6}] «-> 0. 

It remains to take weights of all the sets listed, from the bottom up, and 
solve for the unknown weights. We must be careful, however, to distinguish 
between identical sets on the left hand sides and right hand sides of the 
correspondence relations. For example, consider the correspondence 

[B,a,{}]^{a}U[B,a,{}]U[B,b,{bb}]. 

The weight of the left hand side is simply weight ([B, a, {}]), while the latter 
two sets on the right hand side are assumed to have had their initial letter 
a removed. Thus, the total weight of the right hand side is weight(a) + 
weight(a) • weight ([B, a, {}]) + weight(a) • weight ([B, b, {bb}}). Solving from 
the bottom up, we find: 

weight([B,6,{6}]) = 

weight ([B, a, {a}]) = 

weight([B,6, {bb}}) = weight(&) 

weight QB, &,{}]) = weight (b) + wcight(6)weight([B, b, {}]) 

weight ([B, a, {}]) = weight(a) + weight(a)weight([B, a, {}]) 
+weight (a) weight ([B, b, {bb}]) 

weight ([B, *, {}]) = weight (empty _word) + weight ([B, a, {}]) + weight ([B, 6, {}]). 

Solving for each left hand side quantity and substituting into the last 
equation, which is the equation for wcight(A^), we find: 



weight (M) = 1 



1 — tx a 1 — tXb 

1 t^X a Xfo 



1 - tx a ~ tx b + t 2 X a X b 

□ 

We have implemented this modification of the original Goulden-Jackson cluster 
method, and the code is available in our accompanying Maple package under the 
function name SingleGJ. 

5. Double Letter Weights 

Sometimes we would like to keep track not just of how many times each letter 
appears in a word, but also which consecutive letter pairs appear. This could be 
relevant, for example, if studying English words, when the pair 'QU' is many times 
more likely to appear than the pair 'QB'. In order to keep track of such data, we 
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introduce double letter weights, that is, variables which represent the occurrence of 
consecutive letter pairs. 

To include double letter weights, the weight of a marked word (w;S), where 
w = W1W2WZ ■ ■ ■ Wk, will now be 



We will denote this new weight function W((w;S)). For example, W ((cat; {})) = 
t (x c x a Xt)(x Cja x a j) ■ This new weight function does not have all of the nice prop- 
erties of weight functions we have seen in the earlier methods. In particular, con- 
catenation of words no longer corresponds to a simple multiplication of weights. 
To see why this is true, consider the word abab as the result of concatenating ab 
with itself. In this case, W((uu; {})) does not equal W ((u; {})) 2 . W((ab; {})) = 



t 2 (x a x b )(x a . b ) and so W((ab;{})) 2 = t 4 (x a ) 2 (^) 2 Kfc) 2 , while W((abab; {})) = 
t i (x a ) 2 (x b ) 2 (x aib ) 2 x b>a . 



In general, whenever we concatenate two strings we need to account for the 
double letter weight that crosses from one string to the next. We call this the extra 
factor the transition weight. The original cluster method involves decomposing M , 
then using the fact that a disjoint union of sets corresponds to addition of weight 
functions, and concatenation corresponds to multiplication. We can still use this 
basic principle, but we must be more careful with concatenation. In particular, 
whenever we concatenate strings we will need to know the last letter of the first 
string and the first letter of the second string, in order to be able to multiply by 
the appropriate transition weight. This forces us to change how M is decomposed. 

In the original method, we used the decomposition 



This involves concatenation in two places: in the second term we concatenate an 
arbitrary marked word to a single letter, and in the third term we concatenate an 
arbitrary marked word to a cluster. To incorporate the transition weights we will 
need to know the first letter of an arbitrary marked word, as well as the last letter 
of an arbitrary cluster. 

We start by splitting up the set of marked words according to their first letter. 
Let M a be the set of marked words that start with a, M b be the set of marked 
words that start with b, and so on. We have 



To find weight(A^a), we examine the different types of marked words that can 
begin with the letter a. Such a word may be a itself, or we can peel off the initial 
a to get a shorter marked word (assuming the initial a is not part of a cluster), or 
the word begins with a cluster that begins with a. Let B a be the set of forbidden 
words beginning with a. We have the decomposition 



■ ■ ■ x Wk )(x. 




M = {empty .word} U AM U CM. 
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Accounting for the fact that the entire marked word may be a cluster, we get 
(8) M a = aU ( |J aM b \ U ( (J (J C[v]M b ) U ( |J C[v] J . 

\beA J \v£B a b£A ) \v£B a J 

In this manner we keep track of the first letter of each marked word. It remains to 
address concatenation in the cluster generating functions. 

Cluster Generating Functions. The decomposition of C[v] in the basic cluster 
method is based on the idea that if we have a cluster beginning with a bad word 
v, the cluster is either just that word, or we can peel the first word off and get a 
smaller cluster beginning with a bad word u that has some non-trivial overlap with 
v. Thus we have 

c[«] = «u(U U («V)CM 

\u£B rEO(v,u) 

Since we are computing this for a specific v , we know what the last letter of 
v\r will be. Moreover, we know what the first letter of u will be, so the transition 
weight is easy to write down. Taking weights on both sides, we get 

W(C[v]) = W((v; {«})) - £ W ( V V) ^( C N) 

\ u£B reO(v,u) V v ' 

\ transition weight 

The cluster method is based on computing weights of individual letters and 
clusters, then computing weights of marked words in terms of the letters and clusters 
they contain. However, computing weights of such concatenations requires the 
inclusion of transition weights, since we are incorporating double letter weights into 
our weight-enumerators. In order to do this, it becomes necessary to keep track 
of the last letter of each cluster. When computing cluster weights, we successively 
remove leading forbidden words from a cluster, until we are left with a cluster 
consisting of only one forbidden word. By keeping track of its last letter, we are 
keeping track of the last letter of the original cluster. Thus, it suffices to record the 
last letter of single-word clusters only. We do this by adding a dummy variable to 
the end of each one-word cluster. This dummy variable records the last letter of a 
one- word cluster: 



W(C[v}) := W((v;{v}))End Vk - £ £ W(v\r)x v ^ rl ^ rl+1 W(C[u}) 

j y ' \u£B rGO(v,u) 
dummy V v 1 

Now we have a system of linear equations with variables W(C [?;]), for v 6 B, 
and we can solve for each of these in terms of the dummy variables End a , where 
a e A. However, we don't want our final equation in terms of these variables. When 
we prepend a cluster to an arbitrary word beginning with a, and wish to take the 
resulting weight, we must first replace all occurrences of Endf, (for any letter b) with 
the transition weight Xb, a - This gives us a way to calculate the transition weight 
directly from the cluster generating function, allowing us to use equation |8]). We 
cannot write down exactly what the transition weight will be in general, since it will 
depend on which cluster we are looking at, but if we let Tq denote the transition 
weight calculated for a specific cluster, we have 
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w{M a ) = tx a +Y^ tx ^a.bW{M b )+\ y, Y w ( c tfV TcW ( Mc n+J2 w ( c \f\)- 

b£A \feB a ceA J feB a 

Now that we can calculate the W(A4 a ), we can put them together to find W(M). 
The Maple code for the cluster method, implementing the weight enumerator de- 
scribed in this section, can be found in our accompanying Maple package under the 
function name DoubleGJ. 

Example 2. We return to the setup from Example [TJ redoing the example with 
double letter weights. Recall the problem: find the generating function of all words 
in the alphabet {a, 6} avoiding the forbidden words abb and ba. 

Running DoubleGJ to get the generating function and doing a Taylor expansion, 
we see the first several terms are: 

l + (x b + X a )t+(x 

a&a,b%b H~ ^a^a,a ^b^b^b)^ 1 H~ ^a'^o.^a^a^b^b H - -^a^a a 

+ x b b x b )t 

+ ( X a X a,a X a,bXb + x t X a,a + x b,b X b)^ 4 + ( X a X a,a X a,bX b + X a X t,a + X b,b X b)^ + ^(t 6 ). 

The coefficient of t 4 has terms corresponding to the allowable four-letter words 
aaab, aaaa, and bbbb. □ 

Remark 3 (Comparison of straightforward recursive approach and Goulden-Jack- 
son cluster method) . The straightforward recursive approach will usually require a 
system of at least \A\ + \B\ equations and unknowns, often more (where A is the 
alphabet and B is the set of forbidden words) . The Goulden-Jackson method with 
double letter weights first solves a system of size \B\ (the cluster generating func- 
tions), and then a system of size \A\. Even if the number of equations is roughly 
the same, by breaking things apart a bit we would still expect the Goulden-Jackson 
method to be slightly faster. In practice, however, the differences seem to be small 
for small examples. In general, the best approach depends on the situation. If there 
are many forbidden words with a lot of overlap, then Goulden-Jackson may take 
longer to compute the cluster generating functions, making it slower. However, if 
there are fewer, but longer, forbidden words, the straightforward recursive approach 
may require many more than \A\ + \B\ equations and therefore take longer. 

6. Triple Letter Weights 

As a further generalization of the original cluster method, in this section we will 
keep track of the occurrences of each letter in a word, the occurrences of consecutive 
letter pairs, and the occurrences of consecutive letter triples. We will use a new 
weight function, W , that accounts for all single letters, letter pairs and letter triples 
in a word. For example, 

W' ((abcabc; {})) = t 6 {x 2 a xlx 2 c )(x 2 aJ) xl c x c , a )(xl b c x b:C , a x c , a , b ). 

Note that, as in Section^ we do not have W'((uv; {})) = W'((u; {})) • W'((v; {})). 
In fact in the example above we can see that W ((abcabc; {})) has three extra terms 
that do not appear in W'((abc; {})) 2 . The term x c . a is from the double letter tran- 
sition weight, as described in Section[5] The other two extra terms, x btC ,a and x c ^ a ^ b , 
correspond to the triples that cross between the two factors. These are also consid- 
ered transition weights, and our main problem in this section will be modifying our 
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methods so that it is possible to figure out exactly what these transition weights 
will be. 

For considering double letter weights but not triple letter weights, it is necessary 
to know the last letter of the first string and the first letter of the second string when 
concatenating, in order to write down the appropriate transition weight. Now that 
we are also considering letter triples, we need to know the last two letters of the first 
string and the first two letters of the last string in order to get both triple transition 
weights. For example, to concatenate abed and klmn, the transition weights will 
come from the strings cdk, dkl, and dk, so in order to find these weights we need 
the final two letters of abed, as well as the first two letters of klmn. 

Recall that in our original setup, we decomposed the set of marked words M as 
follows: 

M = {empty .word} U AM U CM. 
In this decomposition, we append marked words to individual letters, and marked 
words to clusters. In order to include triple letter weights we will need to know the 
first two letters of an arbitrary marked word. We will also need the last two letters 
of an arbitrary cluster. 

In order to keep track of the first two letters of an arbitrary marked word, let 
M a b be the set of marked words beginning ab, and decompose M as follows: 

M = {empty .word} USU (J M a b J - 

\a,beA J 

Here S is the set of one-letter marked words. Taking weights gives us 



(9) W'{M) = 1 + E W '(( a > {») + E E W'(M ab ). 

aSA a£Ab£A 

To solve for W'(M a b), we decompose M a b further. A marked word in M ao could 
be the two-letter marked word (ab; {}), or the single letter a followed by an arbitrary 
marked word beginning with b, or it could consist of a cluster beginning with ab, 
followed by an arbitrary marked word. Let B ao C B be the set of all forbidden 
words that begin with ab. We may assume that there are no one-letter forbidden 
words (in that case we would simply remove that letter from the alphabet). Thus 
B is completely partitioned into the B a b- We have 

M a b = {ab} U ( (J aMbc J U ( (J C[v]M\ . 

\ceA ) \veB ab ) 
By taking weights on both sides, using the substitution given in equation ([9]), and 
adding in the appropriate transition weights when we can, we get 



(10) W'(M ab ) = W'{ab) + W'(a) x a , b x a , b , c W'(M bc ) + £ W'(C[v]) 

cGa ^ ' veB ab 

transition 

+ E E^'( C H) T c W(c)+ J2 E W'(C[v])TcW'(M cd ) 

veB ab ceA veB ab c,dGA 

We aren't yet able to fill in the transition weights Tq when they occur at the 
end of a cluster. This is the same problem that occurs in the double letter case, 
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and the solution is the same. Into the cluster generating function we will put the 
dummy variables End a &, for all a, b e A. They will go in the same place End a went, 
and record the last two letters of a cluster. Then we can solve for the W'{M. ab ) m 
terms of the dummy variables and substitute for them when needed. 

We have implemented the cluster method incorporating triple letter weights in 
the function TripleGJ. The code can be found in our accompanying Maple package. 

Remark 4. While it is possible to further generalize this method to include weights 
for four-letter strings or higher, we can see even from triple letter weights why this 
might be problematic. For one thing, efficiency suffers greatly. Using only double 
letter weights leads to a system of \A\ linear equations, while including triple letter 
weights requires \A\ 2 equations. In general, including weights of strings up to length 
k leads to a system of |^4| fe_1 linear equations. 

Moreover, small forbidden words become increasingly problematic. In the triple 
letter case, we relied on the fact that having a one-letter forbidden word is equivalent 
to looking at a smaller alphabet. If we wanted to weight four- letter strings, we 
would partition the forbidden words based on their first three letters, so two-letter 
forbidden words would be a problem. As we include longer and longer subwords in 
our weight function, we get more and more short forbidden word exceptions. 

7. Further Generalizations 

In this section we introduce further variations of the original Goulden-Jackson 
cluster method. These variations have been implemented in our accompanying 
Maple package, and we refer to the variations according to their corresponding 
function names in our software. 

7.1. ProbDoubleGJ, ProbTripleGJ. ProbDoubleGJ and ProbTripleG J are vari- 
ants of DoubleGJ and TripleGJ (respectively), that are specifically designed for 
applications with an underlying Markov chain structure. We can think of the set 
of states in a Markov chain as an alphabet, and a word in that alphabet will corre- 
spond to a history of steps in the Markov process. We move to the next state (or 
equivalently, add the next letter to our word) with some probability that depends 
only on the current state. 

ProbDoubleGJ returns a generating function for subword avoidance in which 
each word is weighted with all of the letter pairs it contains, as well as the sin- 
gle letter weight for the initial single letter only. For example, weight((cai; {})) = 
t 3 (x c )(x c , a x a , t ). We can interpret the double letter weight as a conditional proba- 
bility: x a j, is the probability we will move to state b, given that we are currently at 
state a. The initial single letter probability represents the probability of starting 
in a given state. 

ProbTripleGJ yields a generating function for subword avoidance where each 
word is weighted according to the triple letter strings it contains, as well as its 
initial double letter pair. The weight of a word of length one is the single letter 
weight. For example, 

weight( (abcabc; {})) = t 6 (x atb )(x 2 a b tC x b ^ a x c ^ b ). 

We can interpret the triple letter weight x a ,b,c as the conditional probability of 
seeing c given that the letters immediately preceding it are ab, and the initial 
double letter weight as the probability of satisfying a two-state initial condition. 
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Note that these programs aren't contained in the original double letter weight 
and triple letter weight programs, in the sense that we can't get all the results here 
by a clever choice of the weights in those programs (described in Sections [5] and [5]). 
Our goal here is to selectively weight only the single or double letters that show up 
at the beginning of a word, whereas in the other programs, we assigned a weight 
to every letter in a word, regardless of where it appeared. Nevertheless, with some 
simple modifications to the original programs, we can get the desired results. 

In order to modify our original double letter weight method (implemented in 
the function DoubleGJ), we need only change the weight enumerator used. The 
new weight enumerator will consist of the initial single letter weight multiplied by 
V(w; S), which we define to be the product of all consecutive letter pair weights. 
Define V(a; {}) = t for all single letter words a. For example, V(abba;{}) = 
x a ,bZb,bXb,a- We first solve for the V(A4 a ) as in Section[5j yielding generating func- 
tions that enumerate all words beginning with a, and weighted only by their double 
letter components. It remains to multiply by initial letter weights and put things 
back together: 

/(*) = 1 + X>„V(A*„). 

The situation is similar in adapting the triple letter weight method. We de- 
fine V'(w,S), which counts triple letter weights, i.e. V' (abab; {}) = x a j,, a Xb,a,bi 
V'{ab; {}) = i 2 , V'(a; {}) = t. By solving for the V'(M a b) as in Section we get 
generating functions for marked words, weighted only by their triple letter weights. 
Multiplying by the initial double letter weight and putting everything together 
yields: 

f(t) = l + J2 X at+ X a.bV'{M ab ). 

aeA a,beA 

Example 5. Suppose we are given all initial letter probabilities and pairwise tran- 
sition probabilities. For example, let's start with the alphabet {a, 6}, initial letter 
probabilities x a = 0.75, Xb — 0.25, and double letter probabilities x a ^ a = 0.5, 
X a ,b = 0.5, X b ,a = 0.7, X b ,b = 0.3. 

Suppose we would like to find the probability of avoiding the forbidden words 
bbb and ab. Running ProbDoubleGJ produces the generating function 

1 lOOi + 200 + 3t 3 + 25t 2 
^' ~ ~100 ~t~2 ' 

The first few terms of the Taylor expansion are 

5 9 131 , 131 , nl ^ 
1 + t+ 8* + 400* W + °^- 
The coefficient of t 2 is the probability of seeing one of the three allowable two 
letter words (aa, ba, or bb). We can calculate this probability by subtracting the 
probability of ab from 1. The probability of ab is the initial probability of a, 0.75, 
multiplied by the digraph probability of ab, 0.5. Given a two letter word, we see 
that the probability of seeing the forbidden word ab is 3/8, thus the probability of 
an allowable two letter word is 5/8. □ 

Example 6 (Modeling the English language). We will consider a passage of written 
English as a long string over an alphabet of 27 characters: the 26 lowercase letters 
of the alphabet, and a character SP that corresponds to a space. To keep things 
simple, we ignore punctuation and capitalization. In this example we will use the 
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words character and string when referring to the model and including the character 
SP, and letter and word when referring to English. 

To use ProbDoubleGJ, we need to know information about the frequencies of 
single characters and character pairs. Where this information comes from can have 
a huge effect on the accuracy of this model. For details on how we obtained the 
letter frequencies used in this example, we refer the reader to Appendix |A"1 

In addition to the table of frequencies, we will need a set of forbidden strings. 
For this example, we will use the forbidden string U SP, t, h, e" . This corresponds 
to typing a word beginning with "the" , a common word beginning. 

Running ProbDoubleGJ outputs a rational function with degree 29 polynomials 
in both the numerator and the denominator. We will call this function F(x), and 
we list the first several terms of its Taylor expansion: 

F(x) = 1 + x + x 2 + x 3 + 0.9992162308 ■ x 4 + 0.9992162308 • x 5 + 0.9991288963 ■ x 6 
+ 0.9990341113 ■ x r + 0.9989403663 ■ x 8 + 0.9988466147 ■ x 9 + 0.9987529643 ■ x 10 
+ 0.9986592740 • x 11 + 0.9985656098 • x 12 + 0.9984719485 • x 13 + 0.9983782981 • x 14 
+ 0.9982846557 • x 15 + 0.9981910224 • x 16 + 0.9980973977 • x 17 + 0.9980037819 • a; 18 
+ 0.9979101748 ■ x 19 + 0.9978165765 ■ x 20 + 0.9977229870 ■ x 21 + 0.9976294063 • x 22 
+ 0.9975358344 • x 23 + 0.99 7 4 4 2 2 7 1 2 ■ x 2A + 0.9973487168 ■ x 25 + 0(x 26 ) 

The coefficient of x n corresponds to the probability of avoiding U SP, t, h, e" in a 
length n string, according to the probability distributions given. It makes sense that 
the coefficients of x, x 2 , and x 3 are all one, as the forbidden string is four characters 
long. Looking forward in the series, the coefficient of x 100 is 0.9903570875, the 
coefficient of x 2ao is 0.9811110978, the coefficient of x 300 is 0.9719514288, and the 
coefficient of x 400 is 0.9628772746. 

This means that if a monkey is banging keys on a typewriter according to the 
probability distributions given, even after 100 keystrokes (including the spacebar) 
the monkey only has a 1% chance of typing a word beginning "the...". This prob- 
ability climbs to just under 4% after typing 400 keystrokes. 

To compare, and to answer the age old question about monkeys being able to 
type Hamlet, according to our model, the probability that a monkey could type "to 
be or not to be" after 300 keystrokes is 5.861724357- 10~ 21 . In comparison, if the 
monkey were hitting keys at random, with a 1/27 probability of typing each letter 
or the space key, then the probability of typing "to be or not to be" in the first 
300 keystrokes is 6.62874079 ■ 10~ 27 . More examples are available on the website 
for this paper, as well as some of the functions used to analyze these long strings 
of data, and the dictionary used in this example. □ 

7.2. DoubleGJIF. The program DoubleGJIF is a generalization of ProbDoubleGJ. 

This function returns the generating function for subword avoidance where each 
word is weighted by all of its digraph weights as well as the initial single letter 
and final single letter weights (the IF stands for Initial-Final). The implementation 
allows for setting different values for the weight of a single letter depending on 
whether it is the first or the last letter in a word. If we set all of the final letter 
probabilities to 1, the program reduces to ProbDoubleGJ. 

There is a way to modify ProbDoubleGJ to obtain DoubleGJIF - simply multiply 
by the final letter weights as they occur. Of course, locating the end of a word in 
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the decomposition is a bit more complicated, but it turns out that there are only a 
few places where we need to add terms. 

Recall that in ProbDoubleGJ we have the equation 

f(t) = 1 + *aV{M a ), 
aeA 

and adapted from DoubleGJ we have 

V(M a )=t + J2t^ b V(M b )+[ ]T 52v{C\f\)T c V(M c )} + ]T V(C[f]), 
b£A \feB a ceA J feB a 

where V is the weight enumerator that weights words only with the letter pairs 
they contain. 

Suppose a marked word ends with a single letter (as opposed to a cluster). We 
decompose the word by peeling off letters or clusters from the beginning of the 
word. Once we arrive at the final letter it will seem as if we are looking at a 
marked word consisting of a single letter. Therefore, if we multiply the term in the 
equation above that corresponds to a single letter with the appropriate final letter 
weight, we will have successfully modified the end of every marked word that ends 
in a single letter. 

Similarly, in dealing with the marked words that end in a cluster, we need only 
change the term in the above sum that corresponds to a marked word made of 
one single cluster. In order to implement DoubleGJ we had to locate the end of 
the clusters, and so they are now flagged with the dummy variables End a ,a G A. 
Normally, when we concatenate the empty word to a cluster (effectively ending a 
word with a cluster), we substitute 1 for the terms End a , for all a <E A. Instead, we 
can substitute the final letter probabilities for End a , and we will have successfully 
modified all the marked words that end in a cluster. Since every nonempty marked 
word must end with a single letter or a cluster, we have added in the final letter 
probability to every marked word. 

7.3. DoubleGJst. All of the programs we have discussed so far return a generating 
function in terms of t, but in fact any of the programs can be modified so that they 
return a generating function in two variables, s and t: 

oo n 
n=0 i=0 

where a(i,n) is the number (or weight, depending on the application) of words 
of length n that contain exactly i forbidden subwords (counted with multiplicity). 
For example, if A = {a,b} and B — {abb,ba}, then three out of the sixteen four- 
letter words contain no subwords in B: aaaa, aaab, and bbbb. Ten of the words 
contain exactly one forbidden subword: aaba, aabb, abaa, abab, abbb, baab, baaa, 
bbaa, bbab, and bbba. Finally, three of the words contain exactly two forbidden 
subwords each: baba, babb, abba. Therefore, we have a(0,4) = 3, a(l,4) = 10, and 
a(2,4) = 3. Since this accounts for all 16 of the subwords of length 4, we must 
have that a(3,4) = a(4,4) = 0. Therefore, the coefficient of t 4 in f(s,t) will be 
3 + 10s + 3s 2 . In more complicated programs, a(i, n) would give the weight of these 
words, not just the number of them. Not surprisingly, we can get this generating 
function by modifying the weight enumerator in any of the above programs. 
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Suppose we have a word w that contains k forbidden subwords. The word w will 
be the first argument in 2 fe marked words - one for each subset of the k forbidden 
subwords. In the original Goulden-Jackson method, we multiplied the weight of 
the letter in w by (— so that the number of times a word with k forbidden 
subwords will be be counted is 

=(l + (-l))* = 0*, 

which is if k > and 1 if k = 0. If we would like to keep track of the number 
of forbidden subwords a word contains, we can simply replace the (—1) in the 
weight function with (s — 1). Thus the weight of a marked word (w;S), where 
w = W1W2W3 ■ ■ ■ Wk, will be 

1) ^ t (%Wi • ■ ■ •E'Wk ) (j^Wl ,W2 -^W2 ,103 * • * — 1 ,UJfc ) ■ 

Under this new weight function, the number of times a word containing k subwords 
will be counted is 

V7» _ T\if k \ - (-1 _L (c _ _ k 



^(*-i)M • =(i+(s-i)r=A 

i=0 

as desired. The notion of including an extra variable to count the number of 
forbidden word occurrences was introduced in |NZ99j . The functions SingleGJst, 
DoubleGJst, and ProbDoubleGJst in our accompanying Maple package are the 
respective analogues of SingleGJ, DoubleGJ, and ProbDoubleGJ incorporating the 
variable s. 
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Appendix A. Data for Example [6] 

In order to obtain frequencies for single letter occurrences as well as pairwise 
letter frequencies, we analyzed a list of 20, 422 distinct English words that we 
will refer to as the dictionary. We created a large transition probability matrix 
by taking the frequency of a character pair and then dividing by the number of 
occurrences of the initial character. For example, the following table corresponds 
to the probabilities that any of the 27 characters should follow a: 
pr(a,a)=0.00037487, pr(a,b)= 0.044235, pr(a,c)= 0.059454, 
pr(a,d)= 0.042885, pr(a,e)= 0.0030739, pr(a,f)= 0.010046 
pr(a,g)= 0.033138, pr(a,h)= 0.0039736, pr(a,i)= 0.029090 
pr(a,j)= 0.00067476, pr(a,k)= 0.011771, pr(a,l)= 0.11553 
pr(a,m)= 0.039061, pr(a,n)=0. 14342, pr(a,o)= 0.00074974 
pr(a,p)= 0.034638, pr(a,q)= 0.00089969, pr(a,r)=0.12003 
pr(a,s)= 0.052856, pr(a,t)=0.14530, pr(a,u)= 0.019793 
pr(a,v)= 0.014020, pr(a,w)= 0.0099715, pr(a,x)= 0.0044235 
pr(a,y)= 0.015070, pr(a,z)= 0.0044235, pr(a,SP)= 0.041086 
The last value, pr(a, SP), was computed not from a letter pair in English, but 
from the frequency of words in English that end with a. The overall sum is 1, since 
every occurrence of the letter a is either followed by another letter, or occurs at the 
end of a word. 

Relatively speaking, the probabilities for the character a are fairly evenly dis- 
tributed. To compare, the corresponding values for q look quite different: 

pr(q,a)=0.0, pr(q,b)= 0.0, pr(q,c)= 0.0 

pr(q,d)= 0.0, pr(q,e)= 0.0, pr(q,f)= 0.0 

pr(q,g)= 0.0, pr(q,h)= 0.0, pr(q,i)= 0.0 

pr(q,j)= 0.0, pr(q,k)= 0.0, pr(q,l)= 0.0 

pr(q,m)= 0.0, pr(q,n)=0.0, pr(q,o)= 0.0 

pr(q,p)= 0.0, pr(q,q)= 0.0, pr(q,r)=0.0 

pr(q,s)= 0.0, pr(q,t)=0.0, pr(q,u)= 0.99708 

pr(q,v)= 0.0, pr(q,w)= 0.0, pr(q,x)= 0.0 

pr(q,y)= 0.0, pr(q,z)= 0.0, pr(q,SP)= 0.0029240 
Of the 342 occurrences of the letter q in our dictionary, 341 of them arc followed 
by the letter u, and exactly one of them is at the end of the word. Scrabble fanatics 
will no doubt appreciate that our dictionary is incomplete. We further remark that 
our model ignores context, and the fact that some words are more common than 
others in written English. 

The last row of the probability matrix will be the probalities pr(SP, a), pr(S'P, 6), 
etc. These are the initial letter probabilities, in other words, the probability a word 
begins with a particular letter. As a default, we set pr(S'P, SP) — 0, ensuring that 
between two words there will only be one space. Calculating the single character 
frequencies is straightforward for the characters that are letters. We set the fre- 
quency of SP according to the number of words in the dictionary, with the idea 
that between each word there must be exactly one space. 
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